Introduction

In this ACM CHI 2018 study, ā€œGender-Inclusive Design Sense of Belonging and Bias in Web Interfacesā€, the authors examine the perception of introductory Computer Science course sites, specifically the sense of community belonging by young women when shown websites coded as neutral and masculine (between subjects).

Participants review one of two course sites (either neutral or masculine conditions) and take a survey about the website afterwards, which includes both multiple choice and open response questions. The survey proper is preceded by verification questions to ensure that participants have reviewed the site, and concludes with a demographic survey.

The original paper and the repository

A link to the Qualtrics Survey is Here

A link to the preregistration of this replication: https://osf.io/rvw8t/

Methods

Power Analysis

The original paper reports an effect size for the ambient belonging measure as d = .62. Given a one-tailed independent means t-test, sample sizes to achieve 80%, 90%, and 95% power would be 66, 91, and 114 total participants, respectively. Fewer than 1% of participants were excluded in the original study.

Planned Sample

We planned to recruit 111(subject to power analysis) participants using Amazon Mechanical Turk using the same criteria as the original paper: age between 18 and 25 years of age and from the United States.

Materials

We reproduced the materials as described in the original paper and with materials provided by the original authors. The two webpages hosted will have the same material, but will be rehosted on a new domain. The original experiment features two websites (one neutral and one masculine): (Image and Caption from Original Paper) The banner and other content of the gender-neutral interface used nature imagery (left), while the masculine interface included Star Trek imagery and styling evocative of a computer terminal (right). (Image Taken from Original Paper)

Participants were asked a total of 22 questions across six measures of interest on 1-7 scales (1: ā€œnot at all, 7:ā€extremely"), which were 1. Enrollment Intention, 2. Ambient Belonging, 3. Anticipated Success, 4. Self Confidence, 5. Future CS Study Intentions, 6. Gender-related Anxiety.

Participants were also asked demographic questions about gender, age, race, and education afterwards.

Procedure

Participants are initially asked 3 text-based responses as an initial attention check and a fourth question about the author of the website ā€œto probe for suspicion about the study.ā€ Participants will be excluded upon failure of these questions.

From the original study: >Participants were asked to review one of the two course web- pages and answer survey questions about six main measures regarding their sense of ambient belonging, perceptions of the class and the discipline of computer-science, and gender- related anxiety. To prevent bias, participants were told that the study was intended to ā€œlearn more about young people’s attitudes towards studying Computer Science.ā€ After agreeing to participate, participants were redirected to a Qualtrics survey, which randomly assigned them to review one of the two course pages.

A coded open-ended response question was also asked ā€œWould you take this class? Why or why not?ā€ in the original study. This question will be asked in order to perform the replication as closely as possible but will not be analysed.

Otherwise, the procedure will be followed extremely closely, with the exception of the URLs for the two websites will be different, as they will be newly hosted by the replication author.

Analysis Plan

As per the original study, responses will be excluded based on a failure of the original attention check questions.

The responses will be analysed for statistical significance across the six measures (means of related questions) across women in the masculine condition in contrast with women in the neutral condition in addition to women in the masculine condition compared to all other groups. We use a t-test to indicate statistical significance between the two groups.

Clarify key analysis of interest here The key analysis are the measures for ambient belonging and gender-related anxiety of women in the masculine condition compared to all other groups. Statistically significant difference between the two groups is expressed in a t-test.

Differences from Original Study

The only differences in the study are the absence of the open-response question in the analysis (but not the survey itself) and the different URLs of the websites. We do not predict that the different URL will have a significant effect, as participants will access the website via the same hyperlinked text. The original URLs were cs.stanford.edu/cs106a/(home/course), while the new URLs will be stanford.edu/~erawn/cs106a/(home/course).

Methods Addendum (Post Data Collection)

Actual Sample

13 Participants were excluded due to failing the attention check questions. The data collection was otherwise as described.

Results

Data preparation

Participants will be excluded based on the attention check questions. The mean score for each target measure will be calculated from the relevant questions.

###Data Preparation

####Load Relevant Libraries and Functions
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(ggplot2)
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(forcats)
####Import data
data = read.csv("data/finalData.csv")
turk = read.csv("data/FinalTurk.csv")
#### Data exclusion / filtering
data_filtered = data[-1:-2,]
data_filtered = data_filtered %>% filter(Finished = TRUE)
data_filtered = data_filtered %>% filter(Random.ID %in% turk$Answer.surveycode)

##Exclude those who failed the attention check
data_filtered = data_filtered[-c(10,11,15,16,22,24,41,44,45,47,50,54,60),]

#Select Fields of Interest
data_questions = data_filtered %>%
  select(ResponseId,
         starts_with("EI"),
         starts_with("AB"),
         #starts_with("AS"),
         #starts_with("FN"),
         #starts_with("CS"),
         starts_with("PC"),
         starts_with("LT"),
         #starts_with("SN"),
         starts_with("MF"),
         starts_with("GS"),
         text_condition,
         starts_with("Q18"),
         starts_with("Q52"),
         starts_with("Q53"))

#### Prepare data for analysis - create columns etc.
data_longer = data_questions %>% pivot_longer(cols=-c("Q18", "text_condition", "ResponseId"),
                             names_to = 'Question',
                             values_to = 'Value')

#Handle Exploratory Data fields 
data_longer$Question[data_longer$Question == "Q53"] <- "COURSES"
data_longer$Question[data_longer$Question == "Q52_1"] <- "CSEXPERIENCE"

#Rename Data Fields
data_longer = data_longer %>% 
  mutate(
    Category = str_extract(Question, "[A-Z]+")
  )
data_longer = data_longer %>%
  rename(
    Gender = Q18,
    Treatment = text_condition
  )

data_longer$Gender[data_longer$Gender == "1"] <- "Male"
data_longer$Gender[data_longer$Gender == "2"] <- "Female"
data_longer$Gender[data_longer$Gender == "3"] <- "Other"
data_longer$Category[data_longer$Category == "EI"] <- "Enrollment Intentions"
data_longer$Category[data_longer$Category == "AB"] <- "Ambient Belonging"
data_longer$Category[data_longer$Category == "MF"] <- "Perceived Website Masculinity"
data_longer$Category[data_longer$Category == "PC"] <- "Self Confidence"
data_longer$Category[data_longer$Category == "LT"] <- "Future CS Study Intentions"
data_longer$Category[data_longer$Category == "GS"] <- "Gender-related Anxiety"
data_longer$Treatment[data_longer$Treatment == "1"] <- "Neutral Condition"
data_longer$Treatment[data_longer$Treatment == "2"] <- "Masculine Condition"

data_summary = data_longer %>%
  group_by(ResponseId, Category, Gender, Treatment) %>%
  summarize(MeanQuestion = mean(as.numeric(Value), na.rm=T))
## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion
## `summarise()` regrouping output by 'ResponseId', 'Category', 'Gender' (override with `.groups` argument)
data_stats = data_longer %>%
  group_by(Category, Gender, Treatment) %>%
  summarize(MeanQuestion = mean(as.numeric(Value), na.rm=T),
            StandardDeviation = sd(as.numeric(Value),na.rm = T),
            n = n(),
            StandardError = StandardDeviation / sqrt(n),
            CI = StandardError * 1.96
            )
## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion

## Warning in mean(as.numeric(Value), na.rm = T): NAs introduced by coercion
## Warning in is.data.frame(x): NAs introduced by coercion

## Warning in is.data.frame(x): NAs introduced by coercion

## Warning in is.data.frame(x): NAs introduced by coercion

## Warning in is.data.frame(x): NAs introduced by coercion

## Warning in is.data.frame(x): NAs introduced by coercion
## `summarise()` regrouping output by 'Category', 'Gender' (override with `.groups` argument)

Confirmatory analysis

As specified, we calculate the mean score for each target measure than run a t-test comparing women to men for each condition. Then, run a second analysis comparing women in the masculine condition to all other groups.

#Method For Running T-Tests
ttest <- function(category){
  ab = data_summary %>%
  filter(Category == toString(category))
ab_male_1 = ab %>%
  filter(Gender == "Male",Treatment == "Neutral Condition") 
ab_male_2 = ab %>%
  filter(Gender == "Male",Treatment == "Masculine Condition") 
ab_female_1 = ab %>%
  filter(Gender == "Female",Treatment == "Neutral Condition") 
ab_female_2 = ab %>%
  filter(Gender == "Female",Treatment == "Masculine Condition") 
ab_other <- rbind(ab_male_1,ab_male_2,ab_female_1)

result <- t.test(ab_other["MeanQuestion"],ab_female_2["MeanQuestion"])
return(result)
}

#Run T-Tests on These Categories
ttests <- data.frame("Category" = c("Enrollment Intentions", "Ambient Belonging", "Self Confidence","Future CS Study Intentions","Gender-related Anxiety","Perceived Website Masculinity"), "p_val", "t_val" )

#Run Tests
for(i in 1:nrow(ttests)){
  ttests[i,2] = ttest(ttests[i,1])$p.value
  ttests[i,3] = ttest(ttests[i,1])$statistic
}

#Display Significant Results (if any)
ttests %>% filter(
  X.p_val. < .05
)
##            Category           X.p_val.         X.t_val.
## 1 Ambient Belonging 0.0165900122738623 2.68543155969075
## 2   Self Confidence 0.0479333934882417 2.14235353392181
#remove non-binary gender respondents (consistent with original study)
data_binary <- data_stats %>%
  filter(Gender == "Male" || Gender == "Female")

#Produce Chart
data_binary$Category <- factor(data_binary$Category, levels=c("Enrollment Intentions", "Ambient Belonging", "Self Confidence","Future CS Study Intentions","Gender-related Anxiety","Perceived Website Masculinity"))
data_binary$Treatment <- factor(data_binary$Treatment, levels=c("Neutral Condition", "Masculine Condition"))

p <- ggplot(data_binary, aes(y=MeanQuestion,x=Gender,fill=Treatment)) + 
  geom_bar(position="dodge", stat="identity") +
  geom_errorbar(
    aes(ymin = MeanQuestion - CI, ymax = MeanQuestion + CI),
    width=.2, position = position_dodge(0.9)) +
  scale_fill_brewer(palette="Set1") +
  ylab("Mean Question Response") +
  facet_wrap(~Category) + 
  scale_fill_manual(values=c("Green", "Blue"),limits= c("Neutral Condition","Masculine Condition")) +
  scale_x_discrete(limits=c("Male","Female"))+
  scale_y_discrete(breaks = 0:8,limits = 0:7)
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?
p

Results Overview

Overview of Results(Taken from the Original Paper)

Replicated Graph Replication Final Results While the replicated graph shows that each of the six categories is trending roughly in the same direction as the original study, only the measures of Ambient Belonging t(50) = 2.68 p = .017 and Self Confidence t(50) = 2.14 p = .048 achieved statistical significance.

Exploratory analyses

experience <- data_summary$MeanQuestion[data_summary$Category == "CSEXPERIENCE"]
ambientb <- data_summary$MeanQuestion[data_summary$Category == "Ambient Belonging"]

df = data.frame(experience,ambientb)
cor.test(experience,ambientb,method="pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  experience and ambientb
## t = 3.4972, df = 48, p-value = 0.001024
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1969766 0.6477227
## sample estimates:
##       cor 
## 0.4506179
ggplot(df, aes(x=experience, y=ambientb)) +
  geom_smooth(method="lm") +
  geom_point() +
  ylab("Ambient Belonging Measure") +
  xlab("CS Experience") +
  scale_x_discrete(breaks = 0:10,limits = 0:10)
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?
## `geom_smooth()` using formula 'y ~ x'

  ggtitle("CS Experience and Ambient Belonging")
## $title
## [1] "CS Experience and Ambient Belonging"
## 
## attr(,"class")
## [1] "labels"

Added for this replication was a demographic question at the end of the survey which asked: ā€œHow would you rate your Computer Science Experience?ā€ Here we see that computer science experience positivey correlates with the measure of ambient belonging, which would be consistent with the theoretical basis of the study— those who already had taken Computer Science classes more easily saw themselves as members of a computer science ā€œin groupā€.

courses <- data_summary$MeanQuestion[data_summary$Category == "COURSES"]
ambientb <- data_summary$MeanQuestion[data_summary$Category == "Ambient Belonging"]

df = data.frame(courses,ambientb)
cor.test(courses,ambientb,method="pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  courses and ambientb
## t = 1.7863, df = 48, p-value = 0.08036
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03082017  0.49370699
## sample estimates:
##       cor 
## 0.2496694
ggplot(df, aes(x=courses, y=ambientb)) +
  geom_smooth(method="lm") +
  geom_point() +
  ylab("Ambient Belonging Measure") +
  xlab("CS Courses Taken") +
  scale_x_discrete(breaks = 0:10,limits = 0:10)
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?
## `geom_smooth()` using formula 'y ~ x'

  ggtitle("CS Courses and Ambient Belonging")
## $title
## [1] "CS Courses and Ambient Belonging"
## 
## attr(,"class")
## [1] "labels"

The number of Computer Science classes, however, was only weakly(and non significantly) correlated with ambient belonging. It seems meaningful that all of the participants which had taken >3 Computer Science classes all had relatively high Ambient Belonging scores, perhaps suggesting that only after a couple classes do computer science classes contribute to a feeling of ambient belonging. This finding is only meaningful insofar as it suggests a topic for future study.

#T-tests are run on the M
ttest <- function(category){
  ab = data_summary %>%
  filter(Category == toString(category))
ab_male_1 = ab %>%
  filter(Gender == "Male") 
ab_female_1 = ab %>%
  filter(Gender == "Female") 

result <- t.test(ab_male_1["MeanQuestion"],ab_female_1["MeanQuestion"])
return(result)
}

ttests <- data.frame("Category" = c("Enrollment Intentions", "Ambient Belonging", "Self Confidence","Future CS Study Intentions","Gender-related Anxiety","Perceived Website Masculinity"), "p_val", "t_val" )

for(i in 1:nrow(ttests)){
  ttests[i,2] = ttest(ttests[i,1])$p.value
  ttests[i,3] = ttest(ttests[i,1])$statistic
}

ttests %>% filter(
  X.p_val. < .05
)
##                     Category            X.p_val.         X.t_val.
## 1          Ambient Belonging 0.00369647166784775 3.15123602140597
## 2            Self Confidence 0.00821748584989376 2.78135375538417
## 3 Future CS Study Intentions  0.0308344853680487 2.25327379721226

Lastly, we look for significant differences between men and women regardless of the treatment condition. Here we see significant differences between men and women in their Ambient Belonging, Self Confidence, and Future CS Study Intentions.

#T-tests are run on the M
ttest <- function(category){
  ab = data_summary %>%
  filter(Category == toString(category))
ab_female_1 = ab %>%
  filter(Gender == "Female",Treatment == "Neutral Condition") 
ab_female_2 = ab %>%
  filter(Gender == "Female",Treatment == "Masculine Condition") 

result <- t.test(ab_female_1["MeanQuestion"],ab_female_2["MeanQuestion"])
return(result)
}

ttests <- data.frame("Category" = c("Enrollment Intentions", "Ambient Belonging", "Self Confidence","Future CS Study Intentions","Gender-related Anxiety","Perceived Website Masculinity"), "p_val", "t_val" )

for(i in 1:nrow(ttests)){
  ttests[i,2] = ttest(ttests[i,1])$p.value
  ttests[i,3] = ttest(ttests[i,1])$statistic
}

ttests
##                        Category          X.p_val.           X.t_val.
## 1         Enrollment Intentions 0.645135239322625 -0.469180033828501
## 2             Ambient Belonging 0.507941278408974  0.678689955731923
## 3               Self Confidence 0.602120197444383  0.531060329464502
## 4    Future CS Study Intentions 0.883906234562075  0.148471307827121
## 5        Gender-related Anxiety 0.829492974431829 -0.218599627850714
## 6 Perceived Website Masculinity 0.298023573957291  -1.07935509156326

Finding above that there are no significant differences between women in the two conditions combined with the above analysis that showed that significant differences existed due to gender regardless of the condition might suggest a significant impact of gender regardless of condition on the central statistic of the paper, which compared women in the masculine condition to all other conditions. This would suggest that there exists serious effects due to stereotypes on women when shown computer science material, but that those affects occur regardless of the specific aesthetic or stereotype cues of the content itself.

Discussion

Summary of Replication Attempt

This replication was partially successful in replicating the results of the original paper, as it successfully found significant differences between women in the masculine condition and all other groups on the categories of ambient belonging and self confidence, but failed on all other categories.

Commentary

The exploratory analysis suggests future work analyzing the effect of previous CS experience on feelings of ambient belonging. It also revealed significant effects due to gender regardless of the treatment condition. These effects should have been isolated in the original paper by comparing only women between the masculine and neutral conditions. If no significant differences were found, then it seems possible that the significant differences between women in the masculine condition and all other groups was due to gender regardless of the treatment, which would call into question the findings of the paper and would suggest that women on the whole experience greater resistance to ambient belonging cues related to computer science regardless of the specific aesthetic of the material.

This replication had a much smaller sample size compared to the original study, and so it remains possible that this study was simply underpowered.

Other than the added demographic questions at the end and the different URLs of the test websites (which were not mentioned in the suspicion probe question), this replication was identical to the original study.